-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite push_to_hub to use upload_files #18366
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also cc'ing @Wauplin for visibility (and potential review)
src/transformers/utils/hub.py
Outdated
url = upload_file( | ||
path_or_fileobj=os.path.join(working_dir, file), | ||
path_in_repo=file, | ||
repo_id=repo_id, | ||
token=token, | ||
commit_message=commit_message, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not upload all files in one commit, using create_commit
(or upload_folder
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If using upload_folder
, you won't need to track modified files with their timestamps: if the files have the same name as others on the Hub and contain the same values, they won't get pushed (so if you have local checkpoints that have already been pushed it won't push them again, only verify if they're on the Hub).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments to use upload_folder
instead, it seems like it's an easy enough drop-in replacement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upload_folder
is not ideal here as we may be in a folder where the user has some other stuff (like training artifacts) they didn't explicitly asked us to push. But create_commit
should work.
(I was asked to use upload_file
looks like my instructions were not correct 😝 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like you have to take your "instructions" from the right persons=)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🫣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR @sgugger!
It looks great to me, and uses a non-deprecated API which is great. I've added pointers for the switch from upload_file
to upload_folder
, which is a drop-in replacement.
I've tested that all tests pass locally, the only change that will need to be done is in the commit message attributed to the upload.
>>> pt_model.push_to_hub("my-awesome-model", organization="my-awesome-org") | ||
>>> pt_model.push_to_hub("my-awesome-org/my-awesome-model") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very welcome change
repo_id = f"{organization}/{repo_id}" | ||
|
||
token = HfFolder.get_token() if use_auth_token is True else use_auth_token | ||
url = create_repo(repo_id=repo_id, token=token, private=private, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this prepares transformers
for huggingface_hub version 0.10.0 (was previously using deprecated behavior)
src/transformers/utils/hub.py
Outdated
url = upload_file( | ||
path_or_fileobj=os.path.join(working_dir, file), | ||
path_in_repo=file, | ||
repo_id=repo_id, | ||
token=token, | ||
commit_message=commit_message, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/transformers/utils/hub.py
Outdated
url = upload_file( | ||
path_or_fileobj=os.path.join(working_dir, file), | ||
path_in_repo=file, | ||
repo_id=repo_id, | ||
token=token, | ||
commit_message=commit_message, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If using upload_folder
, you won't need to track modified files with their timestamps: if the files have the same name as others on the Hub and contain the same values, they won't get pushed (so if you have local checkpoints that have already been pushed it won't push them again, only verify if they're on the Hub).
src/transformers/utils/hub.py
Outdated
modified_files = [ | ||
f | ||
for f in os.listdir(working_dir) | ||
if f not in files_timestamps or os.path.getmtime(os.path.join(working_dir, f)) > files_timestamps[f] | ||
] | ||
for file in modified_files: | ||
if commit_message is None: | ||
commit_message = f"Upload {file}." | ||
logger.info(f"Uploading {file} to {repo_id}") | ||
url = upload_file( | ||
path_or_fileobj=os.path.join(working_dir, file), | ||
path_in_repo=file, | ||
repo_id=repo_id, | ||
token=token, | ||
commit_message=commit_message, | ||
) | ||
logger.info(f"{file} was uploaded, you can find it at {url}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified_files = [ | |
f | |
for f in os.listdir(working_dir) | |
if f not in files_timestamps or os.path.getmtime(os.path.join(working_dir, f)) > files_timestamps[f] | |
] | |
for file in modified_files: | |
if commit_message is None: | |
commit_message = f"Upload {file}." | |
logger.info(f"Uploading {file} to {repo_id}") | |
url = upload_file( | |
path_or_fileobj=os.path.join(working_dir, file), | |
path_in_repo=file, | |
repo_id=repo_id, | |
token=token, | |
commit_message=commit_message, | |
) | |
logger.info(f"{file} was uploaded, you can find it at {url}.") | |
url = upload_folder( | |
repo_id=repo_id, | |
folder_path=working_dir, | |
path_in_repo='.', | |
commit_message=commit_message, token=token | |
) | |
logger.info(f"New version was uploaded, you can find it at {url}.") |
src/transformers/utils/hub.py
Outdated
@@ -36,12 +36,13 @@ | |||
|
|||
import requests | |||
from filelock import FileLock | |||
from huggingface_hub import HfFolder, Repository, create_repo, list_repo_files, whoami | |||
from huggingface_hub import HfFolder, create_repo, list_repo_files, upload_file, whoami |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from huggingface_hub import HfFolder, create_repo, list_repo_files, upload_file, whoami | |
from huggingface_hub import HfFolder, create_repo, list_repo_files, upload_folder, whoami |
src/transformers/utils/hub.py
Outdated
url = upload_file( | ||
path_or_fileobj=os.path.join(working_dir, file), | ||
path_in_repo=file, | ||
repo_id=repo_id, | ||
token=token, | ||
commit_message=commit_message, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments to use upload_folder
instead, it seems like it's an easy enough drop-in replacement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! We'll likely adapt it for the Hub mixin in huggingface-hub.
LGTM!
Upload the model files to the 🤗 Model Hub while synchronizing a local clone of the repo in `repo_path_or_name`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small detail: does it make sense to refer to repo_id
instead of repo_path_or_name
since the later seems to be deprecated ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just realised it is a merged PR anyway 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can still address ;-)
* Rewrite push_to_hub to use upload_files * Adapt the doc a bit * Address review comments and clean doc
What does this PR do?
As asked in #18299 this PR rewrites the
PushToHubMixin
completely to useupload_file
behind the scenes. This is breaking since there is no more git repository involved, so while the user might have been left with a proper repository in the past, this is not the case anymore. The trade-off is that there is no need to clone anything now, so the upload should be faster.Another breaking change is to default
use_temp_files
to None now, which then defaults toTrue
if there is a local folder of the same name,False
otherwise. This felt way more natural with this change, but we can discuss and revisit if needed. For the rest, normally backward compatibility is ensured with proper deprecation warnings when needed.Using
push_to_hub=True
insave_pretrained
is adapted to this rewrite but with 0 breaking change normally.You can see in the tests how this simplifies a lot of things at the end of the day, so I'm quite happy with the final API.
Oh and one last change I made is to move the code specific to TensorFlow models in the TF modeling utils file.